SISMID Tidyverse Module 3: dplyr verbs for transforming data frames

Learning Objectives

  1. Filter and reorder rows in a data frame
  2. Create, reorder, and rename columns
  3. Calculate summary statistics for data groups

Example datasets:

nycflights13 flight data
Includes data on all departing flights from JFK, LGA, and EWR for all of 2013

nycflights13 weather data
Includes hourly weather data at JFK, LGA, and EWR for all of 2013

Start by running this code (chunk: load packages)

library(tidyverse)
library(nycflights13)

flights <- nycflights13::flights
weather <- nycflights13::weather

Outline

  1. Rows
  2. Columns
  3. Groups

Use filter() to keep or omit specific rows

Artwork by @allison_horst
Artwork by @allison_horst

filter() is equivalent to subsetting or indexing in base R

filter() in action

Let’s say we only want data for a specific airport from the flights data set
filter() goes through each row/observation and checks to see if it meets the criteria you’ve specified
If yes, that row/observation remains in the resulting data frame; if no, that row/observation is removed

What is this line of code doing? (chunk: filter 1)

jfk_flights <- filter(flights, origin == "JFK")

In this example, filter is keeping all rows/observations that have “JFK” in the origin column and removing all others

filter() in action

We can do the same with numeric variables

What is this line of code doing? (chunk: filter 1)

nyc_am_flights <- filter(flights, dep_time < 1200)

Here, we want to keep only those flights that departed in the morning

filter(), cont.

Options for filtering:

filter()

How do we complete these lines of code to filter to all major airports in Texas (IAH, DFW, HOU)? (chunk: filter 2)

flights %>% filter(dest == "IAH", [ADD CODE HERE])
flights %>% filter(dest %in% c("IAH", [ADD CODE HERE]))

Both of these filter statements accomplish the same thing (chunk: filter 3)

flights %>% filter(dest %in% c("IAH", "DFW", "HOU"))

#equivalent to
flights %>% filter(dest == "IAH" | dest == "DFW" | dest == "HOU")

slice()

slice() is similar to filter(), but you select rows based on their position in the dataset rather than by some condition

What is this line of code doing? (chunk slice 1)

flights %>% slice(500:510)

This code pulls out rows 500 - 510 in the flights dataset

slice()

You can use slice_head() and slice_tail() to replace base R’s head() and tail() functions and specify the number of rows you want to see

Complete these lines of code to view the first 6 rows and the last 12 rows of flight data (chunk: slice 2)

flights %>% slice_head([ADD CODE HERE])
flights %>% slice_tail([ADD CODE HERE])
flights %>% slice_head(n = 6)
flights %>% slice_tail(n = 12)

arrange()

arrange() allows us to sort/order our dataset by specific variables

What is this line of code doing? (chunk: arrange 1)

flights %>% arrange(dep_time)

This code arranges flights by their departure time, from smallest to largest

arrange()

The default direction of sorting is ascending, but we can use desc() for descending

How can we alter this code to arrange flights by departure time, largest to smallest? (chunk: arrange 2)

flights %>% arrange(desc([ADD CODE HERE]))
flights %>% arrange(desc(dep_time))

arrange()

You can also select multiple variables in the order you want to sort them

Write a statement that arranges the flight data by arrival time in descending order, departure time in ascending order, and day in ascending order (chunk: arrange 3)

flights %>% arrange([ADD CODE HERE])
flights %>% arrange(desc(arr_time), dep_time, day)

EXERCISE 1.1

Use pipes + what we’ve just learned to answer the following

  1. How many flights departed from JFK on December 3rd? (chunk: practice 1.1)
flights %>% filter([ADD CODE HERE])
flights %>% filter(origin == "JFK" & month == 12 & day == 3)

ANSWER: 295

EXERCISE 1.2

Use pipes + what we’ve just learned to answer the following

  1. How many flights went to ATL in March? What was the shortest flight time? (chunk: practice 1.2)
flights %>% [ADD CODE HERE] %>%
            arrange([ADD CODE HERE])
flights %>% filter(dest == "ATL" & month == 3) %>%
            arrange(air_time)

ANSWER: 1,448

EXERCISE 1.3

Use pipes + what we’ve just learned to answer the following

  1. How many flights are missing arrival delay data? (chunk: practice 1.3)
flights %>% [ADD CODE HERE](is.na([ADD CODE HERE]))
flights %>% filter(is.na(arr_delay))

ANSWER: 9,430

EXERCISE 1.4

Use pipes + what we’ve just learned to answer the following

  1. How many flights leaving LGA reported a departure delay over 30 minutes? (chunk: practice 1.4)
flights %>% [ADD CODE HERE]
flights %>% filter(dep_delay > 30 & origin == "LGA")

ANSWER: 13,136

EXERCISE 1.5

Use pipes + what we’ve just learned to answer the following

  1. What was the destination of the flight that had the smallest arrival delay, but traveled the longest distance? (chunk: practice 1.5)
flights %>% [ADD CODE HERE]
flights %>% arrange(arr_delay, desc(distance))


ANSWER: SFO

Columns

Pick specific columns to work with (or exclude) with select()

select() is similar to filter(), but for columns instead of rows
You can select columns by name, position, or other characteristics

Complete this code to select the origin, destination, and distance columns (chunk: select 1)

flights %>% select([ADD CODE HERE])
flights %>% select(origin, dest, distance)

select()

Complete this code to select the columns 13, 6, 2, and 8 (chunk: select 2)

flights %>% select([ADD CODE HERE]
flights %>% select(13, 6, 2, 8)

select()

Complete this code to select all columns except tail number and time/hour (chunk: select 3)

flights %>% select(-[ADD CODE HERE], -[ADD CODE HERE])
flights %>% select(-tailnum, -time_hour)

select()

What is this code doing? (chunk: select 4)

flights %>% select(contains("dep"))

This code selects columns that contain the string “dep”

Make new variables with mutate()

mutate() is a powerhouse function for data manipulation - it allows you to make new variables
We’ll cover mutate() in a basic sense now and delve into it more in the next module

Artwork by @allison_horst
Artwork by @allison_horst

mutate()

You can use mutate() to create or calculate new variables from existing variables

What is this code doing? (chunk: mutate 1)

flights %>% mutate(speed = distance/air_time*60)

Here we’re creating a new variable called speed that’s calculated by dividing values in the distance column by values in the air_time column

The new column is added to the end of the dataset

We’ll be learning more about functions to use with mutate() in the next module!

mutate()

Write a line of code that creates a new column called kilometers, which is calculated by multiplying the distance in miles by 1.609 (chunk: mutate 2)

flights %>% mutate([ADD CODE HERE])
flights %>% mutate(kilometers = distance * 1.609)

Shuffle around and reorder columns in a dataset with relocate()

Artwork by @allison_horst
Artwork by @allison_horst

relocate()

We can use relocate() to move one or more columns to a different position in our dataset, specifying which column they should go before or after

Complete this line of code to relocate any columns with the string “wind” to appear after the precipitation column (chunk: relocate 1)

weather %>% relocate(contains([ADD CODE HERE]), .after = [ADDS CODE HERE])
weather %>% relocate(contains("wind"), .after = precip)

Use rename() to rename columns in your dataset

rename() is exactly what is sounds like - it renames columns/variables in a dataset
Even if we use clean_names(), there may still be some changes we want or need to make to certain column names

Artwork by @allison_horst
Artwork by @allison_horst

rename()

The argument structure is new_name = old_name

Complete this code to rename the dewp column to dew point (chunk: rename 1)

weather %>% rename([ADD CODE HERE] = [ADD CODE HERE])
weather %>% rename(dew_point = dewp)

EXERCISE 2.1

Use pipes + what we’ve learned so far to answer or accomplish the following

  1. Make a table that contains data on flight arrivals to SFO from EWR for the 15th day of each month (chunk: exercise 2.1)
flights %>% filter(origin == [ADD CODE HERE] & dest == [ADD CODE HERE] & [ADD CODE HERE]) %>%
            select(contains([ADD CODE HERE]))
flights %>% filter(origin == "EWR" & dest == "SFO" & day == 15) %>%
            select(contains("arr"))

EXERCISE 2.2

Use pipes + what we’ve learned so far to answer or accomplish the following

  1. Make a table of weather observations where the temperature exceeded 90 degrees, but with only origin, date/time, and temp variables (chunk: exercise 2.2)
weather %>% [ADD CODE HERE] %>%
            select([ADD CODE HERE])
weather %>% filter(temp > 90) %>%
            select(origin:temp)

EXERCISE 2.3

Use pipes + what we’ve learned so far to answer or accomplish the following

  1. Make a table of flights that departed early and arrived early,
    shows only departure delay, arrival delay, and distance with the largest arrival delay first,
    create a new column called total_delay that adds together the departure and arrival delays and put it in front of the distance column, and
    rename the departure and arrival delay columns to fully spell out departure/arrival (chunk: exercise 2.3)
flights %>% filter([ADD CODE HERE]) %>%
            mutate(total_delay = [ADD CODE HERE]) %>%
            select([ADD CODE HERE]) %>%
            rename([ADD CODE HERE]) %>%
            [ADD CODE HERE]
flights %>% filter(dep_delay < 0 & arr_delay < 0) %>%
            mutate(total_delay = dep_delay + arr_delay) %>%
            select(dep_delay, arr_delay, total_delay, distance) %>%
            rename(departure_delay = dep_delay, arrival_delay = arr_delay) %>%
            arrange(arrival_delay)

Groups

Create data groups with group_by()

group_by() lets us group our dataset by one or more variables or columns
This allows us to then calculate summary statistics for specific groups

For example, if we have a dataset containing the height and age of different trees,
we can group by the age variable and calculate the mean height for each age group

rforecology.com
rforecology.com

group_by() always needs to operate with summarize(), unless you want to summarize your whole dataset down to a single value

Calculate data summaries for groups with summarize()

summarize() is similar to mutate() but operates on grouped data
This allows us to do things like calculate a mean for each group or get the number of observations for each group
group_by() and summarize() are always used together

What is this code doing? (chunk: group summarize 1)

weather %>% group_by(origin, month) %>%
            summarize(mean_temp = mean(temp, na.rm = TRUE))

Here we are grouping by both origin and month, then using summarize() to create a new column that calculates the mean temperature for each airport each month

summarize()

Complete this code to create a table that counts the number of flights from each origin (chunk: group summarize 2)

flights %>% group_by([ADD CODE HERE]) %>%
            summarize(count = [ADD CODE HERE])
flights %>% group_by(origin) %>%
            summarize(count = n())

We’ll be learning more about functions to use with summarize() in the next module!

EXERCISE 3.1

Use pipes + what we’ve learned so far to answer or accomplish the following

  1. How many flights departed on the busiest day of the year for each airport? (chunk: exercise 3.1)

[HINT: check out the function top_n()]

flights %>% group_by(origin, [ADD CODE HERE]) %>%
            summarize(count = [ADD CODE HERE]) %>%
            group_by([ADD CODE HERE]) %>%
            top_n([ADD CODE HERE])
flights %>% group_by(origin, month, day) %>%
            summarize(count = n()) %>%
            group_by(origin) %>%
            top_n(1)

EXERCISE 3.2

Use pipes + what we’ve learned so far to answer or accomplish the following

  1. What was the coldest day of the year for each airport? How many flights departed each airport on that day? (chunk: exercise 3.2)
weather %>% group_by([ADD CODE HERE], month, day) %>%
            summarize(min_temp = [ADD CODE HERE]) %>%
            group_by([ADD CODE HERE]) %>%
            filter(min_temp == [ADD CODE HERE])

flights %>% filter(day == [ADD CODE HERE] & month == [ADD CODE HERE]) %>%
            group_by([ADD CODE HERE]) %>%
            summarize(count = [ADD CODE HERE])
weather %>% group_by(origin, month, day) %>%
            summarize(min_temp = min(temp, na.rm = TRUE)) %>%
            group_by(origin) %>%
            filter(min_temp == min(min_temp, na.rm = TRUE))

flights %>% filter(day == 23 & month == 1) %>%
            group_by(origin) %>%
            summarize(count = n())

EXERCISE 3.3

Use pipes + what we’ve learned so far to answer or accomplish the following

  1. Which destination airport does each NYC airport fly to most often? (chunk: exercise 3.3)
flights %>% group_by([ADD CODE HERE]) %>%
            summarize([ADD CODE HERE]) %>%
            [ADD CODE HERE]
flights %>% group_by(origin, dest) %>%
            summarize(count = n()) %>%
            top_n(1)

EXERCISE 3.4

Use pipes + what we’ve learned so far to answer or accomplish the following

  1. What is the total monthly rainfall at each NYC airport in 2023? Which month is the rainiest? (chunk: exercise 3.4)
[ADD CODE HERE] %>% group_by([ADD CODE HERE]) %>%
                    summarize(total_prcp = [ADD CODE HERE]) %>%
                    pivot_wider(names_from = [ADD CODE HERE], values_from = [ADD CODE HERE])
weather %>% group_by(origin, month) %>%
            summarize(total_prcp = sum(precip, na.rm = TRUE)) %>%
            pivot_wider(names_from = origin, values_from = total_prcp)

EXERCISE 3.5

Use pipes + what we’ve learned so far to answer or accomplish the following

  1. What is the longest recorded flight departure delay for each airport? (chunk: exercise 3.5)
flights %>% [ADD CODE HERE] %>%
            summarize([ADD CODE HERE])
flights %>% group_by(origin) %>%
            summarize(max_delay = max(dep_delay, na.rm = TRUE))

EXERCISE 3.6

Use pipes + what we’ve learned so far to answer or accomplish the following

  1. How many flights departed more than 30 minutes late from each airport? Which carrier was most commonly delayed more than 30 minute? (chunk: exercise 3.6)
flights %>% [ADD CODE HERE] %>%
            group_by([ADD CODE HERE]) %>%
            summarize([ADD CODE HERE])

flights %>% [ADD CODE HERE] %>%
            group_by([ADD CODE HERE]) %>%
            summarize([ADD CODE HERE]) %>%
            [ADD CODE HERE]
flights %>% filter(dep_delay > 30) %>%
            group_by(origin) %>%
            summarize(count = n())

flights %>% filter(dep_delay > 30) %>%
            group_by(origin, carrier) %>%
            summarize(count = n()) %>%
            top_n(1)

Revisiting our learning objectives

  1. Filter and reorder rows in a data frame
  2. Create, reorder, and rename columns
  3. Calculate summary statistics for data groups

Packages: dplyr
Data transformation with dplyr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf